Study of Japanese Text Compression

نویسندگان

  • Noriko Satoh
  • Takashi Morihara
  • Yoshiyuki Okada
چکیده

The Japanese language has several thousand distinct characters, and the character code length is 16 bits. In such documents the X-bit units are interrelated. Conventional text compression employs g-bit sampling because the compressed object is usually English text. We investigated compression schemes based on ldbit sampling, expecting it improve the compression performance. In Japanese text where words are short, statistical schemes with PPM[l] provide better compression ratios than slide dictionary schemes. So we investigated ldbit sampling based on statistical schemes with a PPM model. We show the ldbit sampling scheme provides good compression ratios in short documents under several tens of kilobytes, such as office reports. The processing speed is also better. 2. Algorithm We investigated 16-bit sampling based on PPMC for a document using ldbit characters. In 16-bit sampling where the sampling unit is the same as a character unit, a symbol can be encoded to a suitable bit length the first time it occurs, according to the number of distinct characters present. (In Japanese, the number of distinct characters is 7,000 so the suitable length is 13 bits.) For text data of under several kilobytes with many distinct characters, the compression ratio is greatly affected by the code lengths of the initial occurrences of symbols. Furthermore, enlarging the sampling unit reduces the entire sampling number and the number of operations. But it increases symbol variations and will extend the searching time. In the following, we investigated how both affect processing time. 3. Experimental Results We compared each sampling scheme using Japanese newspaper articles for sample data. The 16-bit scheme’s compression ratio is as good as that of the g-bit scheme. In addition, encoding a symbol using 13 bits the first time it occurs improves the compression ratio up to 10% (Fig. 1). The number of symbols in each context is basically same except for the zero byte context (Fig. 2), and the entire stored symbol number is reduced by half. The increase of the zero byte context’s symbol doesn’t affect the searching process when using a lookup table, so the search time per symbol will be the same. Thus the entire sampling number is reduced by half, and the searching process is halved. 4. References [ 11 T. Bell, I. H. Witten, and J. G. Cleary, “Modeling for Text Compression”, ACM Computing Surveys, Vol. 21, No. 4. pp. 557-591 (December 1989). [2] Chi-Hung Chi, Chi-Kwun Kan, Kwok-Shing Cheng, and Ling Wong, “Extending Huffman Coding for Multilingual Text Compression”, Proc. of DCC95, pp. 437. 1 ,o Context Sampling unit e length 20.75 8-bit 1 IB-bit

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Effect of Cadmium on the Ultrastructure and Metallothionein Levels in the Liver and Kidneys of Japanese quail

Background: The aim of this study was to use Japanese quail as an animal model to evaluate the effects of cadmium (Cd) on the ultrastructure and the activity of metallothionein (MT) in the liver and kidneys. Methods: One hundred male Japanese quails were randomly divided into two Cd and control groups in 2015. The first group received 100 ppm Cd for 60 days in their feed. The ultrastructural c...

متن کامل

Extending Huffman Coding for Multilingual Text Compression

Traditional text compression algorithms such as Huffman and LZ variants are usually based on 8-bit characters sampling. However, under the unicode representation for multilingual information, the character set of each language such as Chinese and Japanese is consisted of a very number of distinct characters and thus 16-bit or 32-bit character sampling is needed. Consequently, when text compress...

متن کامل

Modality Expressions in Japanese and Their Automatic Paraphrasing

It is important for future NLP systems to formulate the semantic equivalence (and more generally, the semantic similarity) of natural language expressions. In particular, paraphrasing, full text information retrieval, example-based MT and document compression technology require the effective equivalence criterion for linguistic expressions. In this paper, first, we discuss the meaning of Japane...

متن کامل

Puretalk: a high quality Japanese text-to-speech system

This paper describes a high quality Japanese text to speech (TTS) system, PureTalk. This system is similar to the conventional diphone-based TTS using PSOLA except that PureTalk employs the following novel techniques which enable to produce more intelligible and natural-sounding speech: 1) two-stage duration modeling based on a linear regression technique, 2) F0 contour modeling using polynomia...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999